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Abstract. We consider the problem of parameter estimation for a system of 
ordinary differential equations from noisy observations on a solution of the sys- 
tem. In case the system is nonlinear, as it typically is in practical applications, 
|/~\ ^ an analytic solution to it usually does not exist. Consequently, straightforward 

^vj , estimation methods like the ordinary least squares method depend on repet- 

itive use of numerical integration in order to determine the solution of the 
system for each of the parameter values considered, and to find subsequently 
^H ^ the parameter estimate that minimises the objective function. This induces a 

Tf\ . huge computational load to such estimation methods. We study the asymp- 

totic consistency of an alternative estimator that is defined as a minimiser of an 
appropriate distance between a nonparametrically estimated derivative of the 
rrt , solution and the right-hand side of the system applied to a nonparametrically 

estimated solution. This smooth and match estimator (SME) bypasses nu- 
merical integration altogether and reduces the amount of computational time 
drastically compared to ordinary least squares. Moreover, we show that un- 
-^1 ^ der suitable regularity conditions this smooth and match estimation procedure 

►-^ , leads to a ^n-consistent estimator of the parameter of interest. 
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en \ 1. Brief Introduction 

^-». . Many dynamical systems in science and applications are modelled by a d-dimensional 

f— ^ ' system of ordinary differential equations, denoted as 






(1) 



. ^ where B is the unknown parameter of interest and ^ is the initial condition. With 

jrt ' a^6i(i) the solution vector corresponding to the parameter value 0, we observe 

Yij ^ X0j{ti) + e^j, i = 1, . . . , n, j = 1, . . . , d, 

where the observation times 0<ii<...<t„<l are known and the random 
variables e^ have mean and model measurement errors combined with latent 
random deviations from the idealised model ([Ij. Under regularity conditions the 
ordinary least squares estimator 

n d 

(2) 0n = argmin^ ^ ^ (Y,j - x^j {t,)f 

z=l j=l 
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oi9 is -yn-consistent, at least theoretically. For systems ([T]) that do not have explicit 
solutions, one typically uses iterative procedures to approximate this ordinary least 
squares estimator. However, since every iteration in such a procedure involves nu- 
merical integration of the system ([1]) and since the number of iterations is typically 
very large, in p ractice it is often extremely difficult if not impossible to compute 
^, cf. p. 172 in I Voiti ( 20001 ). Here we present a feasible and computationally much 



faster method to estimate the parameter 9. To define the estimator of 6 we first 
construct kernel estimators 

" 1 /t-t \ 

1=1 ^ ^ 

of xsj with K a kernel function and b — bn a bandwidth. Now, the estimator 0n of 
9 is defined as 

(3) ^„ = argmin^/ \\x'{t) - F{x{t),i])fw{t) dt, 

Jo 

where || • || denotes the usual Euclidean norm and w{-) is a weight function. Related 

approaches have been suggested in c o mput a tional biology and numeri cal a nalysis 



literat ure, see e.g. iBellman and RothI (|l97ll ). IVoit and Savageaul (|1982l ) and I Varah 



(119821) . 

The main result of this paper is that this smooth and match estimator 9n is 
•\/n-consistent under mild regularity conditions. So, asymptotically the SME 9n is 
comparable to the ordinary least squares estimator in statistical performance, but 
it avoids the computationally costly repeated use of numerical integration of ([1]) . 

2. Introduction 

Let us introduce the contents of this paper in more detail. Systems of ordi- 
nary differential equations play a fund amental role in many b ranches of natural 
sciences, e. g. mathematical biology, see lEdelstein-Keshetl (|2005l ). biochemistry, see 



1 I — ' — I ^ ^^ 

Voit ( 2000j , or the t heory of chemi cal reaction networks in general, see for instance 

Feinbera (,19791 ) and ISontagl (|200l[ ). Such systems usually depend on parameters. 



which in practice are often only approximately known, or are plainly unknown. 
Knowledge of these parameters is critical for the study of the dynamical system 
or process that the system of ordinary differential equations describes. Since these 
parameters usually cannot be measured directly, they have to be inferred from, as 
a rule, noisy measurements of various quantities associated with the process under 
study. More formally, in this paper we consider the following setting: let, as in ([1]), 

(4) U'{t)^F{x{t),9), te[0,l], 

be a system of autonomous differential equations depending on a vector of real- 
valued parameters. Here x{t) = {xi{t), . . . , Xd{t)) is a d-dimensional state variable, 
9 — {9i, . . . ,9p)^ denotes a p-dimensional parameter, while the column d- vector 
x{0) — S. defines the initial condition. Whether the latter is known or unknown, is 
not relevant in the present context, as long as it stays fixed. Denote a solution to 
(jlj corresponding to parameter value 9 by xe{t) = {x0i{t), . . . ,xed{t))'^- Suppose 
that at known time instances < ^i < ■ ■ ■ < tn < 1 noisy observations 

(5) Ky = xojiU) + Ey , i = 1, . . . , n, j = 1, . . . , d. 
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on the solution xe are available. The random variables eij model measurement 
errors, but they might also contain latent random deviations from the idealized 
model ([T}. Such random deviations are often seen in real- world applications. Based 
on these observations, the goal is to infer the value of 6, the parameter of interest. 
The standard approach to estimation of is based on the le ast squares m ethod 
(the least squares metho d is credite d to Gaufi and Legendre, see StigleiJ (J1981I )). see 
e.g. Hemker (1972 ) and IStortelden (11996.) • The least squares estimator is defined 
as a minimiser of the sum of squares, i.e. 



9n = argmin i?„ (77) = argmin. 



1=1 j=i 



iY^, 



,iU))' 



If the measurement errors are Gaussian, then 0„ coincides with the maximum like- 
lihood estimator and is asymptotically efficient. Since the differential equations 
setting is covered by the general theory of nonlinear least squares, theoretical re- 
sults available for the latter a pply also in the differential equations setting and 
we ref e r e.g. to Jennrich (119691) andlWul (Il98ll) or more generally to van de Geer 
(jl990l ). Ivan de Geer and Wegkampl (jl996l ). and IPollard and Radchenkol (|2006l l for 
a thorough treatment of the asymptotics of the nonlinear least squares estima- 
to r. The paper that explicitly deals with the ordinary differential equations setting 



IS 



Xue et al 



(120101) . Despite its appealing theoretical properties, in practice the 
performance of the least squares method can dramatically degrade if (|4]) is a non- 
linear high-dimensional system and if 9 is high-dimensional. In such a case we 
have to face a nonlinear optimisation problem (quite often with many local min- 
ima) and search for a global minimum of the least squares criterion function i?„ 
in a high-dimensional parameter space. The search process is most often done 
via gradient-based methods, e.g. the Levenberg-Marquardt method. seelMarguardt 
([1963), or via random search algorithms, see Section 4.5.2 in I Void ( 20001 ) for a liter- 
ature overview. Since nonlinear systems in general do not have solutions in closed 
form, use of numerical integration within a gradient-based search method and se- 
rious computational time associated with it seem to be inevitable. For instance, a 
relatively simple e x ample of a four-dimensional system considered in Appendix 1 of 
Voit and Almeidal (|2004l) demonstrates that the need to repeat numerical integra- 
tion multiple times might increase the computational time for numerical integration 
up to 95% of the total computational time required for a gradient based optimisation 
method. Likewise, random search algorithms are also very costly computationally 
and in general, computational time will typically be a problem for any optimisation 
algorithm that relies on numerical integration of any relat i vely r ealistic nonlinear 
system of or dinary differential equ ations, cf. p. 172 in I Voiti ( 20001 ). One example is 
furnished bv lKikuchi et al.l ( 20031 ). where a system that consists of five differential 
equations and con tains sixty parameters a,nd tha t describes a simple gene regula- 
tory network from iHlavacek and Savageaul ( 19961 ) is considered. The optimisation 
algorithm (a genetic algorithm) was run for seven loops each lasting for about ten 
hours on the AIST CBRC Magi Cluster with 1040 CPUs (Pentium III 933 MHz)Q. 
This amounted to a total of ca. 70,000 CPU hours. The authors also remarked 
that the gradient-based search algorithm would not be feasible in their setting at 
all. The problems become aggravated for systems of ordinary differential equations 



See http://www.cbrc.jp/magi for the cluster specifications. 
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that exhibit stiff behaviour, i.e. systems that contain both 'slow' and 'fast' variables 
and that are difficult to integ rate via explicit numerical integration schemes, see 
e.g. iHairer and Wanneij ( 19961 ) for a comprehensive treatment of methods of solv- 
ing numerically stiff systems. Even if a system is not stiff for the true parameter 
value 9, during the numerical optimisation procedure one might pass the vicinity 
of parameters for which the system is stiff, which will necessarily slow down the 
optimisation process. 

The Bayesia n approach to estimation of 6, see e.g. iGelman et al.l (|l996l ) and 
Girolamil ( 20081 ). encounters similar huge computational problems. In the Bayesian 
approach one puts a prior on the parameter 9 and then obtains the posterior via 
Bayes' formula. The posterior contains all the information required in the Bayesian 
paradigm and can be used to compute e.g. point estimates of 9 or Bayesian credible 
intervals. If 9 is high-dimensional, the posterior will typically not be manageable 
by numerical integration and one will have to resort to Markov Chain Monte Carlo 
(MCMC) methods. However, sampling from the posterior distribution for 9 via 
MCMC necessitates at each step numerical integration of the system (j4]), in case the 
latter does not have a closed form solution. Computational time might thus become 
a problem in this case as well. Also, since in general the likelihood s urface will have a 
complex shape with many local optima, ripples, and ridges, see e.g. lGirolamil ( 2008 ) 
for an example, serious convergence problems might arise for MCMC samplers. 

Yet another point is that in practice both the least squares method and the 
Bayesian approach require good initial guesses of the parameter values. If these are 
not available, then both approaches might have problems with convergence to the 
true parameter value within a reasonable amount of time. 

Over the years a number of improvements upon the classical methods to com- 
pute the least squares estimator hav e been propo sed in the literature. In particu- 
lar, the multiple shooting method of iBocld ([1983!) and the interior-point or barrier 
method for large-scale nonlinear programming as in IWachter and Biegleii (|2006l ) 
have proved to be quite successful. These two approaches tend to be much more 
stable than classical gradient-based methods, have a better chance to converge even 
from poor initial guesses of parameters, and in general require a far less number 
of iterations until convergence is achieved. However, they still require a nontrivial 
amount of computational power. 

A general overview of the typical difficul ties in parameter esti mation for systems 
of ordinary differential equations is given in lRamsav et al.l (J2007l) . to which we refer 
for more details. For a recent overview of typical approaches to parameter estima- 
tion for systems o f ordinary differential equations in biochemistry and associated 
challenges see e.g. IChou and Voiti (J2009I ). 

To evade difficulties associated with the least squares method, or more precisely 
wi th numerical integr a tion t hat it usually requi res, a two-step method was proposed 



Bellman and RothI ( 19711 ) and IVarahl ( 1982f) . In the first step the solution xg of 
(in is estimated by considering estimation of the individual components xgi, . . . , xga 
as nonparametric regression problems and by using the regression spline method for 
estimation of these components. The derivatives of xgi, . . . , xsd are also estimated 
from the data by differentiating the estimators of xgi, . . . , xed with respect to time 
t. Thus no numerical integration of the system ^ is needed. In the second step the 
obtained estimate of xq and its derivative x'g are plugged into ([4]) and an estimator 
of 9 is defined as a minimiser in 9 of an appropriate distance between the estimated 
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left- and righthand sides of ^ as e.g. in ([3]). Since this estimator of 6 results from 
a mini r nisatio n procedure, it is an M-estima t or, se e e.g. the classi cal monograph 
Hubeij (|l98ll ). or Chapter 7 oflBickelet al.l (Il998l). C hapter 5 of Ivan der VaartI 
(|l998l ). and Chapter 3.2 of IWellner and van der Vaart I (|l996l ) for a more modern 



ex position ot tne tneory ot M -estii nators. J^or a n approa cn to estimation ot ti relate d 
to Bellm an an d RothI (197lh a ndlVarahl (|l982[ ) see also IVoit and Savage"aul (|l982[ ). 



as^ii as Voi t and Almeida l liJ^T^h;;^ nracticai^^i^i^^ientation ha^;;]-^; 
neural networks is studied. The intuitive idea behind the use of this two-step 
estimator is clear: among all functions defined on [0,1], any reasonably defined 
distance between the left- and righthand side of ((4]) is minimal (namely, it is zero) 
for the solution xg of Q and the true parameter value 9. For estimates close enough 
in an appropriate sense to the solution xe, the minimisation procedure will produce 
a minimiser close to the true parameter value, provided cert ain iden t ifiabi lity and 
continuity conditions hold. This intuiti ve idea was exploite d in Brunell 



. I2008D where 

a more general setting than the one in lBellman an d Roth (1971) and Varah (1 1982 ) 
was consider ed. Another paper in the same spirit as Bellman and Roth (19711 ) and 



Varahl (|l982[ ) is lLiang andWul (|2008[ ). 



This two-step approach will typically lead to considerable savings in computa- 
tional time, as unlike the straightforward least squares estimator, in its first step 
it just requires finding nonparametric estimates of xg and x'g, for which fast and 
numerically reliable recipes are available, whereas the gradient-based least squares 
method will still rely on successive numerical integrations of (|4]) for different pa- 
rameter values in order to fi nd a global minimi s er mi nimising the least squares 
criterion function. We refer to IVoit and Almeidal ( 20041 ) for a particular example 
demonstrating gains in the computational time achieved by the two-step estimator 
in comparison to the ordinary least squares estimator. When the righthand side 
-F of (J4|) is linear in 9i,. . . ,9p and d = 1, further simplifications will occur in the 
second step of the two-step estimation procedure, as one will essentially only have 
to face a weighted linear regression problem then. This is unlike the least squares 
approach, which cannot exploit linearity of F in 6i,...,9p. However, we would 
also like to stress the fact that the two-step estimator does not necessarily have to 
be considered a competitor of either the least squares or the Bayesian approach. 
Indeed, since in practice both of these approaches require good initial guesses for 
parameter values, these can be supplied by the two-step estimator. In this sense 
the proposed two-step estimation approach can be thought of as complementing 
both the least squares and the Bayesian approaches. Moreover, an additional mod- 
ified Newton-Raphson step suffices to arrive at an estimator that is asymptotically 
equivalent to the exact ordinary least squares estimator, as will be shown elsewhere. 

A certain limitation of the two-step approach is that it requires that measure- 
ments on all state variables xgj,j = 1, . . . , d are available. The latter is not always 
the case in practical applications. In some cases the unobserved variables can be 
eliminated by transforming the first order system into a higher order one and next 
applying a generalisation of our smooth and match method to this higher order sys- 
tem. This approach should yield a consistent estimator. One might also formally 
perform the least squares procedure in such a case. However, without stringent 
assumptions on the system it is far from clear that this leads to a consistent esti- 
mator. 
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Our goal in the present work is to undertake a rigorous study o f the as y mpto tics 



of a two-step estimator of 9. Our exposition is similar to that in iBrunell (|2008l ) to 
some degree, but one of the differences is that instead of regression spline estimators 
we use kernel- type estimators for estimation of xg and a^gO The conditions are 
also different. We hope that our contribution will motivate further research into 
the interesting topic of parameter estimation for systems of ordinary differential 
equations. 

There exists an alternative approac h to the ones descri bed here, which also 
employs nonparametric smoothing, see'Ramsav et al.' J2007). For information on 



its asymptotic properties we refer to lOi and Zhao (2010f ) . For nonlinear systems 



this appproach will typically reduce to one of the realisations of the ordinary least 
squares method, e.g. Newton-Raphson algorithm, where however numerical integra- 
tion of Q will be replaced by approximation of the solution of the system (j4|) by an 
appropriately chosen element of some finite-dimensional function space. This seems 
to reduce considerably the computational load in comparison to the gradient-based 
optimisation methods which employ numerical integration of (j4]). However, it still 
appears to be computationally more intense than the two-step approach advocated 
in the present work. 

We conclude the discussion in this section by noting that when modelling various 
processes, some authors prefer not to specify the righthand side of @ explicitly 
(the latter amounts to explicit specification of the F{-, ■) in (|4])), but simply assume 
that the righthand side of (j4]) is some unknown function of x, i.e. is given by F(x{t)) 
with F unknown, a nd proceed to its estimation via nonparametric methods, see e.g. 



Ellner et al.l (|2002l ). This has an advantage of safeguarding against possible model 
misspecification. However, the question whether one has or has not to specify F 
explicitly appears to us to be more of a philosophical nature and boils down to a 
discussion on the use of parametric or nonparametric models, i.e. whether one has 
strong enough reasons to believe that the process under study can be described by 
a model as in (|4]) with F known or not. We do not address this question here, 
because an answer to it obviously depends on the process u nder study and varies 



from case to case. For a related discussion see iHookeiJ ( 20091 ) 



The rest of the paper is organised as follows: in the next section we will detail 
the approach that we use and present its theoretical properties. In particular, 
we will show that under appropriate conditions our two-step approach leads to a 
consistent estimator with a ,Jn convergence rate, which is the best possible rate in 
regular parametric modelf|j. Section 0] contains a discussion on the obtained results 
together with simulation examples. The proofs of the main results are relegated to 
Section [5l while the Appendices contain some auxiliary statements. 

3. Results 

First of all, we point out that in the present study we will be concerned with the 
asymptotic behaviour of an appropriate two-step estimator of 9 under a suitable 
sampling scheme. We will primarily be interested in intuitively understanding the 
behaviour of a relatively simple estimator of 9, as well as in a clear presentation of 



The proofs of the main results in lBrunell l l200Sf l are incomplete and the main theorems require 
further conditions in order to hold. 

It is claimed in iLiang and Wul 1120081) that their two-step estimation procedure leads to a 
faster rate than y'n, which is impossible. Indeed, their Theorem 2 and its proof are incorrect. 
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the obtained results and the proofs. Consequently, the stated conditions will not 
always be minimal and can typically be relaxed at appropriate places. 
We first define the sampling scheme. 

Condition 1. The observation times < ti < . . . < i„ < 1 are deterministic and 
known and there exists a constant cq > 1, such that for all n 

max \ti — ti_i| < — 

2<i<n n 

holds. Furthermore, there exists a constant Ci > 1, such that for any interval 
A Q [0, 1] of length \A\ and all n > 1 the inequality 

1 " /I 

~ Yl ^[ueA] < ci max \A\, - 

holds. 

Hence, we observe the solution of the system ^ on the interval [0, 1]. Instead of 
[0, 1] we could have taken any other bounded interval. Conditi ons on t-\ t„, as 



in Condition [T] are typica l in nonparametr ic regression, see e.g. iGasser and Miiller 



(|l984l ) and Section 1.7 in iTsvbakovl (|2009l ). and they imply that ii, . . . , t„ are dis- 



tributed over [0, 1] in a sufficiently uniform manner. The most important example 
in which Condition [T] is satisfied, is when the observations are spaced equidistantly 
over [0, 1] , i.e. when tj = j/n for j = 1, . . . , d. In this case one may take cq = ci = 2. 
Notice that we do not necessarily assume that the initial condition a;(0) = ^ is mea- 
sured or is known. If it is, then it is incorporated into the observations and is used 
in the first step of the two-step estimation procedure. 

Condition 2. The random variables eij , i — 1, . . . ,n, j ~ 1, . . . ,d, from ([5]) are 
independent and are normally distributed with mean zero and finite variance ct^ . 

This assumption of Gaussianity of the e,j's may be dropped in various ways, as we 
will see below; see the note after Proposition [1] and Appendix B. 
We next state a condition on the parameter set. 

Condition 3. The parameter set Q is a compact subset ofW. 

Compactness of Q allows one to put relatively weak conditions on the structure of 
the system (j4]), i.e. the function F. 

Just as the least squares method, see e.g. I JennrichI ( 19691 ) . our smooth and match 



approach also requires some regularity of the solutions of ((H). In what follows, a 
derivative of any function / with respect to the variable y will be denoted by /'. 
For the second derivative of / with respect to y we will use the notation /'' with a 
similar convention for mixed derivatives. An integral of a vector- or matrix-valued 
function will be understood componentwise. 

Condition 4. The following conditions hold: 

(i) the mapping F : R*^ x — > M'* from (|4]) is such that its second derivatives 

Fgg,Fg'^,F"^ are continuous; 
(ii) for all parameter values 6 d Q, the solution xg of Q is defined on the 

interval [0, 1]; 
(iii) for all parameter values 9 (z O, the solution xe of ([4]) is unique on [0, 1]; 
(iv) for all parameter values 6 0, the solution xe of ^ is a C" function of 
t on the interval [0, 1] for some positive integer a. 
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Observe that Condition |4](i) implies existence and uniqueness of the solution of ^ 
in some neighbourhood of 0. However, we want the existence and uniqueness to hold 
on the whole interval [0, 1] and therefore a priori require (ii) and (iii). Furthermore, 
a > 2 in (iv) is required when establishing appropriate asymptotic properties of 
nonparametric estimators of the solution xg and its derivative, while a > 3 is needed 
in Propositions[3]and|4l and a > 4 in Theoremfl] respectively. Notice that for every 
6 the solution xg is of class C" in t in a neighbourhood of 0, provided for a given 9 
the function F is of class C" in its first argument. However, we want this to hold on 
the whole interval [0, 1] and theref ore require (iy ). Since in the theory of chemical 
reaction networks, see for instance ISontaa ( 20011) . the components of F are usually 
polynomial or rational functions of xi,. . . ,Xd and 9i, . . . ,9p, the solution of (j4|) 
will be smooth enough in many examples and a > 4 is satisfied in a large number 
of practical examples. For the above-menti oned facts frorn the theory of ordinary 
differential equati ons see e.g. Chapter 2 in lArnoldl ( 1973 ). Also notice that the 
condition on F in Lian g and Wul (|2008| ). see Assumption C on p. 1573, puts severe 
restrictions on F and excludes e.g. quadratic nonlinearities oi F in xi, . . . , xa- This, 
of course, has to be avoided. 



Recall that our observations are Yi^ 



X0j{ti)- 



for 



,",i = 1,- 



We propose the following nonparametric estimator for xgj , 



(6) 



Xj{t) 



i=l 



[''i ^« 



-.)iA- 



t - u 



Y,, 



U' 



where K \s a, kernel function, while the number 5 = 6„ > denotes a bandwidth 
that we take to depend on the sample size n in such a way that 6„ — > as n — )• oo. 
In line with a traditional convention in kernel estimation theory, we will suppress 
the dependence of 5„ on n in our notation, since no confusion will arise. When 
the tiS are equispaced, the estimator ([6]) can in essence b e obtained by mo difying 
the Nadaraya- Watson regression estimator, cf. p. 34 in iTsvbakovl ( 20091 ). It is 
usua lly called the Priest l ey-Ch ao estimator after the authors who first proposed 
it in IPriestlev and Chad (|l972| ). As far as an estimator of x'gjt) is concerned, we 
define it as the derivative of Xj{t) with respect to t, choosing ii' as a differentiable 
function. Notice that the bandwidth b plays a role of regularisation parameter: too 
small a bandwidth results in an estimator with small bias, but large variance, while 
too large a bandwidth res ults in an estima. tor with small variance, but large bias, 
see e.g. pp. 7-8 and 32 in ITsvbakovl (|2009t ) for a relevant discussion. In principle 



one could use different bandwidth sequences for estimation of Xj for different j's, 
but as can be seen from the proofs in Section [S] asymptotically this will not make 
a difference for an estimator oi 9. A similar remark applies to the use of different 
bandwidths for estimation of xgj and its derivative x'g,. Arguably, the estimator (jH)) 
is simple and there exist other estimators that may outperform it in certain respects 
in practice. However, as we will show later on, even such a simple estimator leads 
to a -v/n-consistent estimator of 9. 



Th e oretical properties of t he Pr iestley -Chao estimator were studied in iBenedetti 
(|l977t ). IPriestlev and Chad (|l972l) . and ISchuster and Yakowit j (|l979t ). However, 
the first two papers do not cover its convergence in the Loo (supremum) norm, 
while the third one does not do it in the form required in the present work. Since 
this is needed in the sequel, we will supply the required statement, see Proposition 
H] below. 
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To put things in a somewhat more general context than the one in our differential 
equations setting, consider the following regression model: 

Yi = ^{U) + ei, i ^ I,. . .,n, 

(7) ii, . . . , i„ satisfy Condition [1] , 

ei, . . . , e„ are i.i.d. Gaussian with E [e^] = and Var [e^] = a^ > 0. 

Our goal is to estimate the regression function fi and its derivative /i'. The estimator 
of /i will be given by an expression similar to ([6|) , namely 

(8) Ut)^J2(^^-t^-l)lKf^—^)Y,, 



while an estimator of /i' will be given by /ij^. We postulate the following condition 
on the kernel K for some strictly positive integer a. 

Condition 5. The kernel K is symmetric and twice continuously differentiahle, it 
has support within [—1, 1], and it satisfies the integrability conditions: J_, K{u)du = 

1 and J_-^ u K {u)du = for i — 1, . . . , a — 1. //a = 1, only the first of the two 
integrability conditions is required. 

The following proposition holds. 

Proposition 1. Suppose the regression model ([7]) is given and Condition\S\ holds. 
Fix a number 5, such that < S < 1/2. 

(i) If pi is a>l times continuously differentiable and 6 — > as n — > cx), then 



(9) sup |^„(i)-^(i)|=OpU" + 4j + Ji^ . 

t(i[ss~s] \ nb^ \l nb J 

(a) If fi is a > 2 times continuously differentiable and 6 — >■ as n —}■ oo, then 



a-i , 1 , . /log" 



(10) sup \f,'Jt)-^^'{t)\^Op{b^-' + — 

tel5,i-5] \ nb^ V nb* ^ 

is valid. In particular, fin o,nd fi'^ are consistent on [(5,1 — 6], if nb^ / log n — >■ cx3 
holds additionally. 

Gaussianity of the e^ 's allows one to prove (HJ and (TTO]) by relatively elementary 
means. This assumption can be modified in various ways, for instance by assuming 
that the e^'s are bounded, and we state and prove the corresponding modification 
of Proposition [T] in Appendix B, see Proposition [5l In general, normality of the 
measurement errors is a standard assumpti on in parameter e stimation for sy stems 



of ordinary different ial equations, see e.g. iGirolamil (|2008f ). iHemken (|1972| ). and 



Ramsav et all (|2007l) 



The following corollary is immediate from Proposition [T] 

Corollary 1. Let a be the same as in Condition^A Under Conditions\^^we have 
for the estimator Xj 



(11) sup \x,{t)-xe,{t)\^Op(b-- ' ^ . ./^°S" 

te[<5,i-<5i \ 



nb'^ V nb 
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1 / log n 



(12) sup \xrit)-x'o,{t)\^Op\b'^- ^ ^,, 

provided a > 2 and 6 — )■ as n -^ oo. In particular, Xj and a;' are consistent, if 
nlr' / log n — > oo holds additionally. 

In the proof of Proposition [5] we will apply the continuous mapping theorem in 
order to prove convergence in probability of certain integrals of F and its derivatives 
with Xj's plugged in. This is where Corollary [T] is used. 

Now that we have consistent (in an appropriate sense) estimators of xsj and a;^ -, 
from the smoothing step we can move to the matching step in the construction of 
our smooth and match estimator oi 0. In particular, we define the estimator On of 
9 as 

On = argmin„ge / ll^'W - F{x{t).riWw{t)dt 
= argmin,,gQ Mn,w{Tl)i 

where || • || denotes the usual Euclidean norm and ly is a weight function. We 
will refer to Mn.wiv) ^^ ^ (random) criterion function. Since Q is compact and 
Mn,w under our conditions is continuous in 77, the minimiser 0„ always exists. The 
fact that On. i s a me asurable function of the observa t ions Y u follows from Lemma 



2 of lJennrichI (|l969l ). Notice that in iLiang and Wul (|2008l ) and IVarahl (|l982h the 



criterion function is given by 

n 

Y,\\x'{t,)-F(x{t,),,M\ 

i=l 

where x and x' are appropriate estimators of xg and x'g. However, in order to obtain 
a -y/n-consistent estimator of 0, it is important to use an integral type criterion: the 
nonparametric estimators of xg and x'g have a slower convergence rate than -y/n and 
this is counterbalanced by the integral criterion from (|T3l) . Indeed, stationarity at 
On leads to ((57|) . The first factor at the left hand side of this equality converges 
to a constant nondegenerate matrix and the righthand side behaves like a linear 
combination of the observations with coefficients of order 1/n thanks to the inte- 
gration; cf. Proposition |4] and its proof. In light of this the choice of the weight 
function w also appears to be important. Furthermore, the observations Yij from 
^ indirectly carry information on the entire curves Xgj{t),t e [0, 1], and not only 
on the points xgj{ti). An integral type criterion allows one to exploit this fact in 
the second step of this smooth and match procedure. 
Introduce the asymptotic criterion 

M^(77)= / \\F{xg{t),0)-F{xg{t),T^)\\^w{t)dt 

JQ 

corresponding to Mn.w Observe that by Condition|3]it is bounded. Using Corollary 
[T]as a building block, one can show that the SME On is consistent. To this end we 
will need the following condition on the weight function w. 

Condition 6. The weight Junction w is a nonnegative function that is continuously 
differentiable, is supported on the interval {6, 1 — S) for some fixed number S, such 
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that < S < 1/2, and is such that the Lebesgue measure of the set {t : w(t) > 0} is 
positive. 

The fact that w vanishes at the endpoints of the interval [S, 1 — 5] and beyond, 
is needed to obtain a y^-consistent estimator of 9. In particular, together with 
differentiabihty of w it is used in order to establish (|40l) . The condition that w 
is supported on {6, 1 — 5) takes care of the bound ary bias effects character istic 



of the conventional kernel-type estimators, see e.g. iGasser and Miilleij (|1984[) for 
more information on this. Boundary effects in kernel estima ti on are usually reme - 
died by using special bounda ry kernels, see e.g. Ivan Esl (jl99l[ ). lGasser et al.l (jlQSSi ). 



Messer and GoldsteinI ( 19931 ). Using such a kernel, it can be expected that in our 



case as well the boundary effects will be eliminated and one may relax the require- 
ment < 5 < 1/2 from Condition |6] to 5 = 0, i.e. to allowing w to be supported on 
(0, 1). The condition that the weight function w is positive on a set with positive 
Lebesgue measure, is important for (in|) to hold and in fact M;(i) = a.e. would be 
a strange choice. 

The following proposition is valid. 

Proposition 2. Suppose 5 — >■ and nb^/logn -^ cx). Under Conditions [J\\Sl and 
the additional identifiability condition 

(14) V£>0, inf M^(r?) > M^(0), 

|lr)-0||>e 



The proposition is proved via a reasoning standard in the theory of M-estimation: 
we show that Mn^w converges to M^ and that the convergence is strong enough 
to imply the convergence of a min imiser 9n of Mn^w to a minimiser 6 of Mm, cf. 
Section 5.2 of Ivan der Vaard (|l998l ). A necessary condition for (fT4|) to hold is that 



xe{-) y^ xg'{-) for 9 y^ 9' . The latter is a minimal assumption for the statistical 
identifiability of the parameter 9. The identifiabi lity condit ion (J14l) is common in 
the theory of M-estimation, see Theorem 5.7 of Ivan der Vaart, (|1998[) . It means 
that is a point of minimum of Mwijf) and that it is a well-separated point of 
minimum. The most trivial example with this condition satisfied is when d ~ p ^ 1 
and x'{t) = 9x{t) hold with initial condition a;(0) = ^, where £, ^Q- In fact, in this 
case 

M^{tj) = (0 - r]fe / e^'>'w{t)dt, 
Js 
and this is zero for rj = 9 and is strictly positive for rj ^ 9, whence (|14[) follows. 
More generally, since Q is compact and M^ is cont inuous, uniqueness of a minimiser 



of M^ wiU imply ([13]), cf. Exercise 27 on p. 84 of Ivan der VaartI (|1998[ ) 



In practice (|14p might be difficult to check globally and one might prefer to con- 
centrate on a simpler local condition: if the first order condition [dMii,{r]) / dri],-i^g — 
holds and if the Hessian matrix H{r]) = {d"^ Mm{r]) / dr]idrij)ij of M^ is strictly 
positive definite at 9, then (I14p will be satisfied for 77 S restricted to some neigh- 
bourhood of 9, because M^ will have a local minimum at such 9 and a neighbour- 
hood around it can be taken to be compact with small enough diameter, so that 
(fT4| holds for 77 restricted to this neighbourhood. The conclusion of the theorem 
will then hold for the parameter set restricte d to this neighbourhood of 9. 

In a statement analogous to Proposition [2| iBrunell ()200S ) requires that the solu- 
tions of (HI belong to a compact set X for all 9 and t and that F from ^ is Lipschitz 
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in its first argument x for x restricted to this compact % uniformly in G 0. It is 
also assumed that the nonparametric estimators Xn{t) belong a.s. to % for all n and 
t. Ho wever, the latter typically will not hold for linear smoothers, see Definition 
1.7 in iTsvbakovl ( 20091 ) . which constitute the most popular choice of nonparametric 
regression est imators in pract ice. For instance, local polynomial esti mators, see 
Sectio n 1.6 in ITsvbakovl ( 20091 ). projection estimators, sec Sectio n 1.7 in ITsvbakovl 
( 20091 ). or the Gasser-Miiller estimator, see iGasser and Miillen ( 19841 ). are all ex- 
amples of linear smoothers. Hence we prefer to avoid this condition altogether, 
although this somewhat complicates the proof. 

Under the conditions in this section it turns out that the estimator 0„ is not 
merely a consistent estimator, but a y^-consistent estimator of 0, in the sense of 
(fTSl) below. This result follows in essence from the fact that up to a higher order 
term the difference 6'„ — 6 can be represented as the difference of the images of x 
and xe under a certain linear mapping, cf. Proposition [3| It is known that even 
though nonparametric curve estim ators cannot usu ally attain the ^/n convergence 
rate, see e.g. Chapters 1 and 2 of ITsvbakovl ( 20091) . extra smoothness often com- 
ing from the structure of linear functionals allows one to construct in many cases 
•\/n-consistent estimators of thes e func tiona ls via plugging in nonparam etric esti- 
mators, see e.g. lBickel and Ritovl (|2003l ) and lGoldstein and Messeij (|1992l ) for more 
information. The variance of such plug-in estimators can often be proven to be of 
order n~^ ^ while the squared bias can be made of order n~^ by undersmoothing, 
i.e. selecting the smoothing parameter smaller than what is an optimal choice in 
nonparametric curve e stima tion when the object of interest is a curve itself, cf. 
Goldstein and MesseiJ (J1992I ). Precisely this happens in our case as well: if the 
mean integrated squared error is used as a performance criterion of a nonparamet- 
ric estimator, then under our conditions the optimal bandwidth for estimation of xe 
is of order n"^/^^""'"-'^^ whereas the optimal bandwidth for estimation of 9 is in fact 
smaller, see Th eorem [1] below. Note th at undersmoothing is a different approach 
than the one in iBickel and Ritovl (|2003l ). where it is assumed that nonparametric 
estimators attain the minimax rate of convergence and the -yn-rate for estimation 
of a functional in concrete examples, if possible, is achieved by different means ex- 
ploiting extra smoothness coming from the structure of a functional, see e.g. the 
first example in Section 2 there. In many cases it can b e proved that such plug-in 
type estimators are efficient, see iBickel and Ritovl (|2003i) . Notice, however, that in 
our case this will not imply that On is efficient. 

First we will provide an asymptotic representation for the difference On — 0. 

Proposition 3. Let be an interior point of O. Suppose that the conditions of 
Proposition [H hold and let the matrix Je defined by 

(15) Je= {F^{xe{t),OyfF;,{xe{t),0)w{t)dt 



be nonsingular. Fix a > S. If b >i n '' holds for 1/(4q; — 4) < 7 < 1/6, then 

(16) On-0 = Op {J^\r{x) ~ r{xg))) + op(n-i/2) 

is valid with the mapping T given by 
(17) 

d 



r(z) 



1-5 



{F;>{xe{t),0)fF:,{xg{t),0)vu{t) - -[{F^{xg{t),Ojfw{t)]\z{t)dt. 
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With the above resuh in mind, in order to complete the study of the asymptotics 
of On, it remains to study the mapping T. Clearly, it suffices to study the asymptotic 
behaviour of 



A(/i„)-A(^)= / v{t)k{t)fin{t)dt~ / v{t)k{t)ii{t)dt, 

where u is a known function that satisfies appropriate assumptions, while k stands 
either for w or its derivative w' . The next proposition deals with the asymptotics 
of A(A„)-A(^). 

Proposition 4. Under Conditions\^ and\^ and for any continuous function v it 
holds in the regression model ([7]) that 

A(An)-A(^)=0p(n-i/2), 

provided jjl is a > Z times differentiahle and the bandwidth b is chosen such that 
b X n-T holds for l/(2a) < 7 < 1/4. 

Our main result is a simple consequence of Propositions [3] and S) 

Theorem 1. Let 9 be an interior point ofQ. Assume that Conditions{^\^ together 
with p4p hold and that (|15p is nonsingular. Fix a > 4. If the bandwidth b is such 
that b X n^'^ holds for l/(2a) < 7 < 1/6, then 

(18) V^ik -0)= Op(l) 

is valid. 

Thus any bandwidth sequences satisfying the conditions in Theorem [T] are op- 
timal, in the sense that they lead to estimators of 6 with similar asymptotic be- 
haviour. In particular, each of such bandwidth sequences ensures a ^/n convergence 
rate of ()„. Consequently, dependence of the asymptotic properties of the estimator 
9n on the bandwidth is less critical than it typically is in nonparametric curve esti- 
mation. Notice that the condition a > 4 in Theorem [T] is needed in order to make 
the conditions in Propositions |3] and Incompatible. 

4. Discussion 

The main result of the paper. Theorem [l] is that under certain conditions for 
systems of ordinary differential equations parameter estimation at the ^/n rate is 
possible without employing numerical integration. Although we have shown this 
in the case when in the first step of the two-step procedure a particular kernel- 
type estimator is used, it may be expected that a similar result holds for other 
nonparametric estimators. For instance, the arguments for the Nadaraya- Watson 
estimator seem to be similar, with extra technicalities arising e.g. from the fact that 
it is a ratio of two functions. Furthermore, from formula (j40| it can be seen that the 
proof of Proposition [3] requires that the derivative of an estimator of xg be used as 
an estimator of x'g . Not all popular nonparametric estimators of the derivatives of a 
regression function are of this type. In practice for small or moderate sample sizes 
it might be advantageous to use more sophisticated nonparametric estimators than 
the Priestley-Chao estimator, but asymptotically this does not make a difference. 

Once a y^-consistent estimator 0„ of 6 is available, one might ask for more, 
namely if one can construct an estimator that is asymptotically equivalent to the 
ordinary least squares estimator ([2]) or that is semiparametrically efficient. It is 
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expected that this can be achieved without repeated numerical integration of ([T]) 
by using 0„ as a starting point and perforrn i ng a one-step Newton -Raphson type 
proced ure; see e.g. Section 7.8 of lBickel et ahl (jl998l ) or Chapter 25 of lvan der Vaard 
(|l998l ). We intend to address this issue of efficient and ordinary least squares 
estimation in a separate publication. 

Doubtless, the main challenge in implementing the smooth and match estimation 
procedure lies in selecting the smoothing parameter b. This is true for any two-step 
parameter estimation procedure f or ordinary di fferential equations, e.g. the one 
base d on the regression spli nes as in lBrunell ( 20081 ) or the local polynomial estimator 
as in iLiang and WrJ ( 20081 ). and not only for our specific estimator. Observations 
that we supply below apply in principle to any two-step estimator and not only 
to the specific kernel-type one considered in the present work. Hence they are of 
general interest. 

Some attention has been paid in the literature to the selection of the smoothing 
parameter in the context of parameter estimation for ordinary differential equations. 
The considered options range from subjective choices and smoothing by hand to 
more advanced possibilities. Perhaps the simplest solution would be to assume 
that the targets of the estimation procedure are xgj, j = 1, . . . ,(i, and to select b 
(a different o ne for every compo nent xgj) via a cross-validation procedure, see e.g. 
Section 5.3 in IWassermanI ( 20061 ) for a description of cross-validation techniques in 
the context of nonparametric regression. This should produce reasonable res ults, at 
least f or relatively large sample sizes, cf. simulation examples considered in [Brunei 
(J2008l ). However, it is clear from Theorem[T]and its proof that despite its simplicity, 
such a choice of b will be suboptimal. Another practical approach to bandwidth 
selection is computation of 0„ — 6n{b) for a range of values of the bandwidth b on 
some discrete grid B and then choosing 



S = argmin^gs ^ J2^^'^ 



'-e^b)] 



(t.))' 



=1 j=i 



This seems a reasonable choice, although the asymptotics of 0„(6) are unclear. One 
other possibility for practical bandwidth selection is nothin g else but a variati on on 
the plug- in bandwidth selection method as described e.g. in Uones et al.1 (|1996[ ): one 
can see from the proof in Section [5] that the terms that depend on the bandwidth 
b are lower order terms in the expansion of On — 0. One can then minimise with 
respect to & a bound on these lower order terms. A minimiser, say 6*, will depend 
on the unknown true parameter 9, also via xg and x'g, as well as on the error 



variances a 



ii ■ 



, CT?. However, 9,xg, and Xg can be re-estimated via 9n,x, and 



using a different, pilot bandwidth b. Of course, instead of x and x' the use of any 
other nonparametric estimators of a regres sion function and its derivative, e.g. local 
polynomial e stimators, see Section 1.6 of iTsvbakovl (|2009l ). or the Gasser-Miiller 
estimator, see lCasser and Miilleii(|l9841 1. is also a valid option. Error terin varian ces 
can be estim ated via one of th e methods described in iHall and MarronI ( 19901 ) or 
Section 5.6 of IWassermanI ( 20061 ) . Once the pilot estimators oi9,xg, and x'g together 
with estimators of cr^, ..., cr^ are available, these can be plugged back into 6* and in 
this way one obtains a bandwidth b that estimates the optimal bandwidth b*. The 
final step would be computation of 9n with a new bandwidth b. Unfortunately, this 
method leads to extremely cumbersome expressions and furthermore, since we are 
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minimising an upper bound on numerous remainder terms, it will probably tend to 
oversmooth, i.e. produce a bandwidth b larger than required. Moreover, the plug- 
in approach in general is subject to some controversy having both supporters and 
critics, see e.g. iLoaderl (|1999 ) and references therein. An alternative to the plug-in 
approach might be an approach based on one of the resampling methods: cross- 
validation, jackknife, or bootstrap. Computationally these resampling methods 
will be quite intensive. Theoretical analysis of the properties of such bandwidth 
selectors is a rather nontrivial task. Also a thorough simulation study is needed 
before the practical value of different bandwidth selection methods can be assessed. 
We do not address these issues here. 

The next observation of this section concerns numerical computation of our SME. 
The kernel- type nonparametric regression estimates of xej, j — l,...,(i, can be 
quickly evaluated on any regular grid of points < si < . . . < Sm, e.g. via tech- 
niques u sing the Fast Fourier Tra nsform (F FT) similar to those desc ribed in Appen- 
dix D of I Wand and Joned ( 1995[ ). See also iFan and MarronI ( 1994 ). Furthermore, 



in the match step of the two-step estimation procedure the criterion function Mn^w 
can be approximated by a finite sum by discretising the integral in it s definition. I f 
F is linear in 9i, ... ,6p and is u nivaria te, then as already obser ved in Var ah (1982j, 



see pp. 29 and 31, cf. p. 1262 in lBruneli (2008) and p. 1573 in Liang and Wu (2008) , 



this will lead to a weighted linear least squares problem, which can be solved in 
a routine fashion without using e.g. random search methods. This is a great sim- 
plification in comparison to the ordinary least squares estimator, which moreover 
will still tend to get trapped in local minima of the least squares criterion function 
despite the fact that F is linear in its parameters. 

We conclude this section with two simple problems illustrating parameter estima- 
tion for systems of ordinary differential equations via the smooth and match method 
studied in the present paper. Our first example deals with the Lotka-Volterra sys- 
tem that is a basic model in population dynamics. It describes evolution over 
time of the populations of two species, predators and their preys. In mathematical 
terms the Lotka-Volterra model is described by a system consisting of two ordinary 
differential equations and depending on the parameter 9 = {01,02,03, 04)^ , 

ix'S) = OiXiit) - 02Xl{t)x2{t), 
\x'^{t)=-03X2{t)+eiXi{t)X2{t). 

Here xi represents the prey population and X2 the predator popula tion. For addi- 
tional information on the Lotka-Volterra system see e.g. Section 6.2 in lEdelstein-Keshet 



(|2005l ). We took 0k = 0.5, fc = 1,...,4 and the initial condition (xi(0), 2:2(0)) 



(1,0.5). The solution to (fT9|) corresponding to these parameter values is plotted in 
Figure [1] with a thin line. The left panel represents xgi, the right panel X02- The 
solution components xei and xg2 are of oscillatory nature and are out of phase of 
each other. Next we simulated a small data set of size n = 50 of observations on 
the solution xe of (J19p over the time interval [0, 25] by taking an equidistant grid 
of time points ti — 0.5i for i — 1, . . . , 50 and setting 

(20) Y,j=xej{ti) + e,j, i = 1, . . . , 50, j = 1, 2, 

where the i.i.d. measurement errors 6^ were generated from the normal distribu- 
tion A^(0,cr^) with mean zero and variance a^ = 0.01. These observations Yij are 
represented by crosses in Figure [TJ 
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Figure 1. Solution of the Lotka-Volterra system (IT9|) (thin hne) 
with parameter values 9k = 0.5, k — 1, . . . , 4, and initial condition 
(a;i(0),a;2(0)) = (1,0.5), observations Yij given by ([20l) with eij ~ 
iV(0, 0.01) (crosses) and the estimates Xj computed with kernel 
(PT|) . weight function ([22]) and bandwidth b=1.2 (sohd line). The 
left panel corresponds to xgi, the right to xe2- 




..0 



Figure 2. Kernel K from (j21l) (left panel) and weight function w 
from (p2)) (right panel). 



The three required ingredients for the construction of an estimator 9n are the 
kernel K^ the weight function w, and the bandwidth b. A general recip e for con- 
struct ion of kernels of an arbitrary order a is given in Section 1.2.2 of iTsvbakov 
( 20091 ) and is based on the use of polynomials that are orthonormal in L2{—1, 1) 
with weights. In particular, we used the ultraspherical or Gegenbauer polynomials 
with weight function v{t) = (1 — i^)^l[|t|<i] and constructed the fourth order kernel 
with them. Notice that our definition of the ke rnel of ord e r a in Condition [5] is 
slightly different from the one in Definition 1.3 of lTsvbakovl (2009J), cf. also the re - 
mark on p. 6 there. For ultraspherical polynomials see Section 4.7 in ISzegol ( 19751) . 
Our fourth order kernel took the form 



(21) 



K{t) 



105 



315 
"64" 



^')(l-i')'l[|t|<i]- 



Notice that K is a symmetric function. The kernel K is plotted in Figure [3] in 
the left panel. An alterna tive here is to use the Gaussian-based kernels as in 
Wand and Schucanvl (|l99Cl) . although they do not have a compact support. As 
far as the weight function w is concerned, any nonnegative function that is equal 
to zero close to the end points of the interval [0, 25], is equal to one on the greater 
part of the interval [0, 25] and is smooth, could have been used. We opted to simply 
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Figure 3. Derivatives of the solution components xgj of the 
Lotka-Voherra system ([T51) (thin hne) with parameter values 6k — 
0.5, fc — 1,...,4, and initial condition (xi(0),X2(0)) = (1,0.5), 
together with derivative estimates i' (solid line) computed with 
kernel ([2T|). weight function (|22l). and bandwidth b — 1.2 using 



observations Yij from (I20|) . The left panel corresponds to x'l, the 
right panel to ^2. 



rescale and shift the function 



Ac,0(i) 



1, if \t\ < c, 

exp[-/3exp[-/3/(|i| - c)2]/(|i| - 1)2], if c < |i| < 1, 

0, if|t|>l, 

that arose in a different context in lMcMurrv and Politid ( 2004 ). see formula (3) on 
p. 552 there, so that it could have the required properties in our context. We took 
the constants c and (3 to be equal to 0.7 and 0.5, respectively, and then set 

,(i-12.5)^ 



(22) 



Kt) = A, 



c,/3 



1.05- 



12.5 



The function w is plotted in the right panel of Figure [21 Finally, since in the 
present work construction of the bandwidth selector is not our primary goal, we 
simply selected b by hand and set it to 1.2. 

The smooth and match estimation procedure was implemente d in Mathematica 



6.0, see IWolfram Research. Inc. Mathematica. Version 6.0l ( 2007 ). We first evalu- 



ated the kernel estimates of the regression functions xei and xe2 at the equidistant 
grid of points Sk = 0.1k with k = 0, . . . , 249. With this number of grid points and 
the sample size n = 50 there was no need to use binning to compute the esti- 
mates and more over, binning wou ld have probably resulted in a slower procedure, 
cf. Figure 3b in iFan and MarronI ([1994); so we did not employ it. However, the 
fact that many of the kernel evaluations K{{sk — ti)/b) are actually the same, cf. 
Fan and MarronI (|1994[ ) , was taken into account and led to savings in computation 
time above the naive implementation of the Priestley-Chao estimator that would 
directly compute K{{t — ti)/b). The estimates xi and X2 are plotted in Figure [1] 
with a solid line, while the estimates x'^ and x'2 are plotted in Figure [31 Notice that 
the estimates x'l and x'2 are severely undersmoothed. We next approximated the 
criterion function Mn.w by a Riemann sum 



249 



fe=0 



(£'i(0.1fc) - r]ixi{0.1k) + ri2Xi{0.1k)x2i0.lk)fw{0.lk)0.1 



18 



SHOTA GUGUSHVILI AND CHRIS A.J. KLAASSEN 





Figure 4. Solution of the Van der Pol system (|23|) (thin line) 
with parameter value — 0.8 and initial condition (a:i(0), 2^2(0)) — 
(1, 1), observations Yij given by (pi)) with e^ ~ iV(0, 0.01) (crosses) 
and the estimates Xj computed with kernel (1211) . weight function 
([22|) . and bandwidth 5=1 (solid line). The left panel corresponds 
to xei and the right to xg2- 



249 

E 

fc=0 



(X2(0.1A:) + 773^2(0. Ifc) - 774x1(0. Ifc)x2(0.1fc))2w(0.1/c)0.1. 



Note that when performing minimisation, the factor 0.1 can be omitted from both 
terms in the above display. The minimisation procedure resulted in the estimate 

On = (0.52,0.50,0.50,0.51)^. 

With our implementation, the total time needed for computation of the estimate of 
9 (including time needed for kernel and weight function evaluations, but excluding 
time needed for loading observations) was about 0.5 seconds on a notebook with 
Intel(R) Pentium(R) Dual CPU T3200 @ 2.00 GHz processor and 4.00 GB RAM. 
The parameter estimates appear to be sufficiently accurate in this particular case. 
Our second example deals with the Van der Pol oscillator that describes an 
electric circuit containing a n onlinear element, s ee p. 333, Problem 12 on p. 365, 
and the references on p. 373 in lEdelstein-Keshetl ( 20051 ). The corresponding system 
of ordinary differential equations takes the form 



(23) 



U{t)^e-^{x^{t)~\{x,{t)f + x2{t)), 
\A{t) = -Oxi{t). 

We took 9 = 0.8 and the initial condition (a;i(0), 0:2(0)) = (1,1). The solution to 
(|23p is of oscillatory nature and the components xgi and Xg2 are out of phase of 
each other. The solution is plotted in Figure|l]with a thin line. We then simulated 
a data set of size n = 50 of observations on the solution xg of ((23)) over the time 
interval [0,25] at an equidistant grid of time points ti = 0.5i,i = 1,...,50, by 
setting 



(24) 



Yij = XgjiU) + • 



-I'j, 



z = l,...,50,j = l,2. 



where the i.i.d. measurement errors Cij were generated from the normal distribution 
A^(0, cr^) with mean zero and variance a^ = 0.01. These observations Yij are plotted 
with crosses in Figured When computing the estimate 0„, we used the same kernel 
and the same weight function as in the previous example, while the bandwidth was 
set to 6 = 1. The estimates of the solution components xgi and xg2 are depicted 
by a solid line in Figure IH while the derivatives x'g^ and x'g2 together with their 
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Figure 5. Derivatives of the solution components xqj of the Van 
der Pol system (P5|) (thin line) with parameter value = 0.8 and 
initial condition (xi(0),X2(0)) = (1,1), together with derivative 
estimates a;' (solid line) computed with kernel (|211) , weight function 
([22|) . and bandwidth & = 1 using observations Yij from (|24|) . The 
left panel corresponds to x'l and the right panel to £'2. 



estimates are given in Figure [5j The estimation procedure resulted in an estimate 
9n — 0.83 and the computation time was about 0.4 seconds. 

We intend to perform a more practically oriented study exploring some of the 
ideas mentioned in this section in a separate publication. 



5. Proofs 

We will use the symbol <, meaning less or equal up to a universal constant 
independent of index n. The symbol x will denote the fact that two sequences of 
real numbers are asymptotically of the same order. 



Proof of Proposition [H We first prove 
equality we have 



For any positive e by Chebyshev's in- 



(25) 



sup |A„(t) - /i(t)| > e < ^<^ sup |E [/i„(f)] - ^(i)l' 
^te[5,i-<51 / ^ yte[s,i-s\ 

+ E sup |/i„(f)-E[/i„(t)]|2 

= ^(Ti + r2). 



By p5|) we can write 

E[A„(i)]-/i(i) 



'^(^)l^(^)'^^-'^(*) + ^U 



For all n large enough, we have b < 5, beca use b — » . Then for all such n, if 
i G [(5, 1 — (5], a standard argument (cf. p. 6 in iTsvbakovl ( 20091 ) ). namely Taylor's 
formula up to order a applied to /i and the moment conditions on the kernel K 
formulated in Condition [SI yields 



(26) sup |E[A„(i)]-/i(i)l<fe° 

te[<5,i-'5] 



/(")| 



■ / \u''K{u)\du + o(^ 
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Next we turn to T2. W ith argumentation similar to that in the proof of Theorem 
1.8 of lTsvbakovl (|2009f l and setting 



S.it) 



t'i — ti-l T^ I i ^ t-i 
: A 



for j — 1, . . . , N, we have 



A= sup |Ai„(i)-E[/t„(t)]| 
te[s,i-s] 



sup 

t£[S,l-S] 



Y,S,{t)e 



1=1 



< max 



E^»(^ 



N = n^ s, = ^, 



sup 

t,t':\t-t'\<N-^ 



J2iSm-Sdt'))e, 



By the mean value theorem and Condition [T] the inequality 



\S^{t)-S.it')\<\\K'\\^-^\t-t'\ 
holds for any i,t' e M, where ||iir'||oo is finite. Hence by the C2-inequality 



(27) 



A < max 

\ 1<3<N 



J2 «"^» 



sup 

t,t':\t-t'\<N-^ 



Y^iSdt) ~ Sdt'))e, 



i=l 



< max |Z,r + 
i<i<Ar 



l-ftT' 



'l|2 



l2647V2 



El 



where Zj — X]"=i Si{sj)£i. Notice that 

\ 2 



(28) 



2^,4^; 



■E 



El 

\i=l 



< 



NH^ 



4^4 



nb 



Moreover, we have 



n /I 

E[Z^]=E-'(^.-^.-i)'^if 



n-2||?<'l|2 " 
^ l^\\U-s,\<h 



n 



262 



i=l 



<i_cia2||i^||^max(2,i- 
no V nb 



where the last inequality follows from Condition [T] Since the ^j's, being a linear 
combination of independent Gau ssian random variables, are themselves Gaussian, 
Corollary 1.3 of lTsvbakovl (J2009[ ) and the fact that N ^ r? then entail 



(29) 



E 



max I Z.i I 



O 



logA^ 



O 



logn 



Combining dU]), (US]) and (g!]), we obtain 



(30) 



;[^2 



o 



logri 
nb 
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Taking 



\ nb'^ V nb 



with an appropriate constant M yields ([9]) by ((25|) . (l26l) . and (l30l) . 

As far as the proof of (|10|) is concerned, it is very much similar to the proof of 
(JH) and is therefore omitted. This completes the proof of the proposition. D 

Proof of Proposition\^ From the definition of Mn^wiv) ^-i^d Mwif]), the elementary 
inequality 

|||ai||'-||a2in< llai-aslKllaill + llaall) 

and the Cauchy-Schwarz inequality we have 

(31) 

r .1 . 1/2 

< I / \\x\t)~F(xg{t),e)+F{xe{t),ri)~F{x{t),ri)fw{t)dt 



X {^1 j \\x'{t)~F{x{t),'nWw{t)dt + Jj \\F(xe{t),e)~F{xe{t),r^Ww{t)dt 



/t\(V7^+%/T; 



3)- 

For Ti we have that 

Ti<2 \\x'{t)~F{xg{t),e)\\^w{t)dt 

(32) ^^ 

+ 2 / \\Fixgit),7]) - Fix{t),ri)fw{t)dt. 
Js 

By ([H]) it holds that 

sup/ ||x'(0-F(a;9(i),6l)|pw(i)dt 

\\x'{t)-x',mMt)dt 

(33) -^5 



<5^ sup |iU0-4eWI' / ^W^i 

j^;^ te[<5,i-<5] JS 



4 0. 
Moreover, by Lemma [3] from Appendix A we obtain that 

(34) sup/ ||^(i(i),?7)-^(xe(i),?7)ll'w^(i)rf^4o. 
Furthermore, T^ — Op{l) as n — > oo, because 

(35) sup/ |lF(a;e(i),6')"F(a;e(t),r/)|pt«(t)di<oo 
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by compactness of 6 and Condition IH and T2 — Op{l), because 

rl-S 

(36) sup/ \\x'{t)~F{x{t),Tj)fw{t)dt = Op{l) 

holds by the inequality 

\\x'{t) - F{x{t),r])\\^w{t)dt 

5 

< \\x'{t)-x'e{t)\\^w{t)dt+ \\x'g{t)-F{xe{t),rj)\\^wit)dt 

Js Js 

\\F{xe{t),r,) ~ F{x{t),r^)fwit)dt, 
s 

Corollary [I] compactness of Q, Condition 21 and Lemma |3] from Appendix A. 
Combination of (I5T|) ~([55 1) implies that 

P 

sup |Af„.„,(77) - Mw{ri)\ -^ 0. 

The statement of the proposition then follows from th is fact, the identifiability 
condition (IT4l). and Theorem 5.7 of van der Vaart (19981). D 



Proof of Proposition [^ We interpret the derivative of a one-dimensional function 
of as a row p- vector of partial derivatives and we denote the d x p-matrix of partial 
derivatives dFi{x, 9)/d9j, i — 1, . . . ,d, j = 1, . . . ,p,hy Fg{x, 9). 
We have 

^^\\x\t)-F{x{t)M\'^~'^{S:'{t)-F{x{t).e)fF'e{m.e). 

With this in mind and interchanging the order of integration and differentiation, 
we find that the derivative of M„^^, from (fT51) with respect to 9 is given by 

-2 / {x'{t)-F{x{t),9)fF^{x{t),9)w{t)dt. 



Since 9 is an interior point of 0, there exists an e > 0, such that the open ball of 
radius e around 9 is contained in Q. Take 

and notice that by consistency of 0„ we have P{Gn) — )■ 1 as n — )■ 00. If 0„ is a point 
of minimum of Mn.w^ then necessarily 

1g„ / {x'{t) - F{x{t),9n)fFl){x{t),9n)w{t)dt = 0, 

where at the righthand side denotes now a row p- vector with all its entries equal 
to zero. The latter display can be rearranged as 

1g,. r \F;,{x{t),9„)fx{{x'{t)~x',{t)) 
Js 

+ {F{xg(t),9) ~ F{x{t), 9)) + {F{x{t), 9) - F{x{t),9,,))}w{t)dt = 0, 
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where now on the righthand side denotes a column p- vector with its entries equal 
to zero. Note that we have 

F{x{t),6)-Fixit),en)^ f F^{x{t),en + X{e-9n))d\{9^9n). 



Hence 



1-5 /•! 

T 



1g„ / {Fl){x{t),en)Y F^{x{t),9„ + X{9-e„))d\w{t)dt{9,,^9) 
J s Jo 

(37) =1g„J {F',{x{t)A)f{x'{t)-x'e{t))w{t)dt 

+ 1g„ / {Fl,{x{t),9n)f{F{xg{t),9) - F{x{t),9))w{t)dt 
Js 

holds. By the fact that x converges in probability as a random element on [S, 1 — S] 
to xg, see ([TTj). consistency of 9n, continuity of Fg, c ontinuity of integratio n and the 
continuous mapping theorem, see Theorem 18.11 in Ivan der Vaard ( 19981) . we have 

{F;,{x{t)Jn)f / Fl,{x{t),9r. + X{9-e.n))dXw{t)dt 

(38) '' '" , 

^ / {F'g{x0{t),e)fF'g{xe{t),e)w{t)dt = Jg, 



where Jg is nonsingular by assumption (|15p . Therefore, (1371) shows that the asymp- 
totic behaviour oi 9n — 9 is given by 

(39) Jg'lj^' {F'g{x{t)X)f{i'{t)~x'g{t))w{t)dt 

+ ^' {F'g{x{t). L)f{F{xg{t),e) - F{x{t), 9))w{t)dt\ . 

It thus remains to be shown that this expression in fact reduces to the righthand 
side of (fTB)) . First of all, notice that 

l-(5 

{F^{x{t), 9n)f{x'{t) - x'g{t))w{t)dt 
Is 

nl-(5 

{F's{xg{t),9)f{x'{t)-x'g{i))w{t)dt 
Is 

(40) +j {Fl,{x{t),0n)'F's{xg{t),9)f{x'{t)^x'g{t))w{t)dt 

(-[F'g{xg{t),e)w{t)]\ {x{t) ^ Xg{t))dt 

+ j {F^{x{t), §,,) - F'g{xg{t), 9)Y{x'{t) - x'g{t))w{t)dt, 

where the last equality follows by integration by parts and the fact that w{6) — 
w{l — 6) — 0. The first term at the righthand side of (|40)) appears also in the leading 
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term r(a;) — r{xe) of (fT6|) . We will now show that the other term at the righthand 
side of (|40|) is negligible, i.e. 

rl-S 

(Fl^ixit), 0„) - F^(xe(t), 9)f{x'{t) - x',{t))w{i)dt - op{n'^'^). 



By the Cauchy-Schwarz inequality 

,.1-5 



s: 



1/2 



{F's{x{t), L) ~ F^{xg{t), 0)fix'{t) - x',{t))w{t)dt 

' nl-S 

j \\F^{x{t),On)-F'e{xg{t),e)fw{t)dt 

\x' [t) - x'g{t)\\^w[t)dt 



< 



1-5 ^ 1/2 

' /J-M|2„ 



where || • || denotes the Frobenius or the Hilbert-Schmidt norm of a matrix (recall 
that it is submultiplicative) . By (|12p we have 

-s ^ 1/2 / 



\x' {t) - x'g(t)\\'^w{t)dt 



Opil) b' 



Q-l 



1 



jib^ V nb'^ 



logn 



Furthermore, 



1-5 



\F'g{x{t)X) - F'g{xe{t),e)W'w{t)dt 



(41) 



< 2 



1-5 



\F'g{x{t),^n) - F'g{xe{t),k)\?w{t)dt 



2 \\F^{xe{t),9n) ~ F^{xe{t),d)\\^w{t)dt 

-2Ti + 2T2. 



Denote Fg{x,9) — A{x,9) — {aij{x,6j)ij. For Ti we have 



,.1-5 



ai^j{xe{t),9n)Y'w{t)dt 



1-5 



1 d 



J2 I [I -^a^^jixgit)+Xixit)-xeit)),9n)dX{x{t)-xg{t))] w{t)dt 



< sup \\xit) ~ Xgit)\\^ 

Vte[5,i-5] , 



E 



1-5 /.I 






ai^j{xg{t) + \{x{t) - Xg{t)),9n) 



dXw{t)dt. 



By (jlip , as well as consistency of 0„ , Condition 2] and the continuous mapping 
theorem, the righthand side in the last inequality is of order 
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By a similar argument, the inequality 

T2= / \\F^{xe{t),0n)-F;,{xe{t),e)\\^w{t)dt 



< 



T 



i-j 







de 



—a,^j{xg{t),e + \{en-e)) 



d\w{t)dt 



holds. Here with some natural abuse of notation we first differentiate aij with re- 
spect to its second argument 9 and only afterwards evaluate the obtained derivative 
at xg{t) and + A(6'„ — 0). Since the integrals at the righthand side of the above 
display are bounded in probability, we then get 

>, 1/2 



l-<5 

(42) 

Now notice that ((39)) yields 

»l-(5 



\F'g{x0{t),6n) - Fl,{xe{t),e)fw{t)dt 



Oi 



I)- 



<Op(l) 



{F'g{x{t),e^)Y {x'{t) - x'g{t))w{t)dt 



{F'g{x{t), e^)f{F{xg{t),e) - F{x{t), 9))w{t)dt 



The Cauchy-Schwarz inequality then gives 

\\On-0\\ <Op(l) 



\\F;,{x{t),en)\\Mt)dt\ 



i-s ] 1/2 

X < / \\x'{t)-x'g{t)\\'^w{t)dt\ 
Op(l)|^ \\F^ix{t),e,,)fw{t)dt\ 

x< / \\F{xe{t),e)- F{x{t),9)\\^w{t)dt 



1/2 



By a by now standard argument, i.e. ()lf p . (|f2p . and the continuous mapping theo- 
rem, the righthand side can be further bounded to obtain 



(43) \\e,,-0\\<Op{l)lb"-^ + ^ 



logn 



i&3 



1 

nb'^ 



logn 
nb 



Summarising the above results, we finally get that the second term at the righthand 
side of (1301) satisfies 



l-<5 



{Fl,{x{t),en) - F;,{xe{t),e)fix'{t) - x'sit))w{t)dt 



< Op(l) 6° 



nb'^ 



logn 



l63 



Op(n-i/2), 
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where the last equahty follows from our conditions on b. Here we also see that the 
condition a > 3 is needed for the conclusion to hold. 

To conclude the proof, it remains to consider the second term within brackets in 
(I39l). We have 



l-<5 



iF;,ix{t)Jn)f{Fix0{t), 0) - F{x{t),e))w{t)dt 



(44) 



l-<5 



{Fl){xe{t),0jf{F{xe{t),0) - F{x{t),S)Mt)dt 



i-s 



{F'f){x{t), 0„) - F's{xg{t), 0)f{F{xg{t), 0) - F{x{t),0))w{t)dt. 



This can be analysed in a by now routine fashion, but we provide proofs. We first 
study the first term at the righthand side. By a standard argument we have 



1-(S 

{F;,{xg{t),0)f{F{xg{t),0)~F{x{t),0))w{t)dt 

l-<5 j.1 

{F'g{xe{t),0)f / F'^{xe{t) + X{x{t)-xe{t)),0)d\{x{t)~xg{t))w{t)dt 

5 Jo 

l-<5 

{F'g{xg{t),0)f F'^{xg{t),0){x{t) - xe{t))w{t)dt 
s 

{F'e{xg{t),0)f / [F'^{xg{t)+\{x{t)~xg{t)),0)-F'^{xg{t),0)]d\{x{t)-xg{t))w{t)dt 
s Jo 

= ^3 + ^4. 

Recalling ([T7]), we see that T3 appears in the leading term T{x) — T{xb) in ([TB|) and 
completes it together with the first term at the righthand side of POI) . Next we 
consider T4. Introduce the notation F'^[x,0) = B{x,0) = {bij{x,0))ij. We have 



[F^ixg{t) + A(x(i) - Xg{t)),0) - F^{xg{t), 0)]dX (i(t) - .Te(i)) 



< sup \\x{t)-Xg{t)\\ 

\t€[S,l-S] J 

X / \\F:^{xgit) + Xixit)~xe{t)),0)-F:,ixgit\0)\\dX 
Jo 



< sup \\xit)~Xg{t)\\ 

\te[s,i-s] 



X / ^|&zj(a;e(i) + A(a;(t)-xe(t)),0)-6y(xe(t),0)|dA 



^,3 



E 



1 rl 9 




< sup \\xit)-xg{t)\\ 

\te[s,i-s] 



■T^hij{xg{t) + KX{x{t) - Xe{t)),0)dKX{x{t) - Xg{t)) 



dX 
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< sup \\x{t) ~ xeit)\\^ 
\te[s,i-s] 

rl /.I 

E 



"'0 



dx 



—hj{xg{t) + KX{x{t) - xe{t)),9) 



dndX, 



where in the last inequahty we used the fact that < A < 1. Since by convergence 
in probabihty of x to xq , Condition |4] and the continuous mapping theorem the 
integrals on the righthand side of the above display are bounded in probability, it 
follows from ^ that ||r4|| is 

op(i) <; I &" + ^ I + ^^^ + I &" 



1 



nb 



1 



logn 



nb 



This in turn is op{n~^^^) because of the conditions on b. Finally, we treat the 
second term at the righthand side of (jH]). By the Cauchy-Schwarz inequality, its 
norm can be bounded by 



1-5 



1/2 



\F;,{m, 



Fl,{xe{t),e)fw{t)dt 



1-5 



1/2 



\F{xe{t),e) ~ F{x{t),e)fw{t)dt 



Each of the terms at the righthand side have already been treated above, see (UH) 
and (|43|) . and it follows that the expression in the last display is op{n~^^^). This 
concludes the proof of Proposition [3l D 

Proof of Proposition \^ By a standard decomposition, we have 

E [(A(A„) - ^{^J)f] = (E [A(A„)] - A(/i))2 + Var [A(A„)] 

The statement of the theorem will follow from Chebyshev's inequality, provided we 
show that the righthand side of the above display is O {n"^) . For Ti we have 



t;(t)fc(i)(E[A„(t)]-/i(t))dt 



< sup \&[fln{t)]-^i{t)\ j \v{t)k{t)\dt 
1 



1^1 1 



where the last equality follows from ([26l) . Taking l/(2a) < 7 < 1/4 gives that 
Ti is O (n~^/^) . We next consider T2. By independence of the ej's, the fact that 
maxj \ti — ti_i| < n~^, boundedness of v and k, and integrability of K, we have 

Y,{t^ - U-i)Y, [ v{t)k{t)\K r^ 1 dt 

i=l 



T2 = ¥ar 

r 



^~^ 1 ff-f- 



dt 
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n 



This completes the proof of Proposition |4l D 

Proof of Theorem[J\ The result is an easy consequence of Propositions [3] and HI D 



Appendix A 

The proof of Proposition [T] is based on the following two lemmas, which pro- 
vide integral approximations to the bias and variance of the estimator /i„ and its 
derivative fl[^ at a point t. 

Lemma 1. Let fi and K be continuously differentiable and let K be supported on 
the interval [—1, 1]. For any t d [0, 1] 

(45) E [fLjt)] = ^ ^^{s)^K (^^^ ds + O (^^ 

holds in the regression model ^. The order bound on the remainder term in (I45p 
is uniform in t ^ [0, 1]. 

Proof. The proof is based on the Riemann sum approximation of the integral. Since 
E [ci] = 0, we have 

^ ^{s)Ik (^^) ds + J2{t^ ~ t,_i)M(t0^i^ (^ 

The first term at the righthand side of this expression is the first term of pS]) . We 
will now establish an upper bound on the difference of the other two terms. Using 
continuous differentiability of [i and K and the fact that max^ \ti — ti-\\ = 0(rr^), 
we have 



n 

E 



1.. (t-t, 



^E 



<4^ 



M«i*- ^ 



„„i,,(£_i) _,„.,',.(£_£= 



^{s)\k" '• 



M(yiA- i^ 



ds 



ds 



ds 



^;^ll/^ll-l|i^'l|oo + ^||A^'||oo||i^||oo, 



which is of order n b . This establishes (H5 



D 



The second lemma can be proved along the same lines as the previous one and 
therefore we omit its proof. The existence of the second derivative of K is needed 
in the proof of this lemma. 
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Lemma 2. Let /i be continuously differentiable and let K be twice continuously 
differentiable and be supported on the interval [—1, 1]. For all t G [0, 1] 



(46) 



nt^'^{t)]= j\{s)^K' 



^V-°Gi 



holds in the regression model ([7]) . Furthermore, if b < S and i G [6,1 — S], then 
integration by parts yields 

(47) E[fi'^{t)]^ f fi'{t~bu)K{u)du + o(^ 

The order bounds on the remainder terms in (1461) and (j47p are uniform in t. 

The following lemma is used in the proof of Proposition [2] 
Lemma 3. Let the stochastic process Xn — {Xn^-q)-qf=s be defined as 

(^n,^)r,ee = ( / \\F{x{t),T])~F{xg{t),ri)\\^w{t)dt] 



Xn 



vee 



Then under the conditions of Proposition\^we have Xn -^ 0, where at the right- 
hand side denotes the zero process on Q and convergence is understood as conver- 
gence for random elements with values in the space C{Q) of continuous functions 
on Q, which is equipped with the supremum norm. 

Proof. To pro ve the lemma, we will verify the conditions of Theorem 18.14 of 



van der Vaard (il998). Bv ([TlT) and the continuous mapping theorem, see Theorem 



18.11 in Ivan der Vaard ()1998l ). for every fixed rj it holds that 
(48) / \\F{x{t),Tj)-F{xe{t),Tj)\\^w{t)dt^O. 

Consequently, for any positive integer k and any r]i, . . . ,rjk € Q we have 

(-'^n,J7i , • ■ • , -'^n,J)fc ) "^ (0, . . . , 0) 



and hence condition (i) of Theorem 18.14 in Ivan der Vaard (|1998[ ) is satisfied. In- 
troduce 



and notice 



G^nl sup 



\xjit) ~ xg,it)\ < p 



sup \Xj{t) -Xgj{t)\ > 13} . 
S] J 

For any positive e and /? and any partition 0i, . . . , 0^ of O we have 



(49) 



P\ sup sup |X„,^ - Xnx\ > £ 



<P sup sup \Xn.r,-Xn.(\>£\G\^P[G^ 
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By (fTTj) we know that 



(50) 



lim P {G") < lim V P f sup \xj{t) - xej{t)\ > ^ | = 



j=i 



te[s,i-s] 



We will now show that for arbitrarily small positive p and e there exists a partition 
6i, . . . , 'dm of 6, such that 



lim sup P sup sup \Xn,r) — Xn.c\ > £; G < p. 

Together with (l49l) and (|50|) this will imply condition (ii) of Theorem 18.14 in 
van der VaartI ( 19981 ) and hence also the fact that X„ converges weakly to zero. 
The statement of the lemma will then be a simple consequence of the fact that 
convergence in di stribution and in pr obability are equivalent for constants, see 
Theorem 18.10 of Ivan der VaartI () 19981 ). 
Notice that 



< [ \\F{x{t), r;) - F{xg{t), r;) - F{x{t), () + F{xg{t), C)|| 
Js 
X (||F(x(i), ry) - F{xg{t), r;)|| + \\F{x{t),0 - F{xe{t),0\\)w{t)dt 

< I ^ \\F{x{t), V) - F{xe{t), Tj) - F{x{t), C) + Fixg{t), Ofwit)dt I 
xM' {\\F{x{t),rj) - F{xg{t),rj)\\ +\\F{x{t)X) ~ F{xg{t),C)\\)Mt)dt\ 



= VT3VT4. 

For T3 we have 

1-1-5 
T3<2 \\F{x{t),r^) - F{x{t)XWw{t)dt 

Js 

+ 2 / \\F{xg{t), r,) - F{xg{t), (Ww{t)dt. 
Js 

Restricting uj's from the sample space il to the set G, we get 

T3<2 f f \\F^ixit)X + \iv-OWd\\\rj-C\\Mt)dt 
Js Jo 

2 1 I \\F'g{xe{t)X + Kv-0Wd\\\7^-Cfw{t)dt 







i-s 



<4h-Cir/ w{t)dt sup ||P^(a;,z.)||=C(/3,«;,0,e)h-Cir 

Js \\xj\\<\\xa,\\^+li,3 = l,...4 

uee 

on the set G. Notice that C{l3,w,9,Q) is a finite constant, because ||Fg(a;, z^)|| is 
continuous and its supremum is taken over a compact set. By similar techniques one 
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can show that T4 < C'{l3,w,9,<d) for some constant C'{(3,w,9,&) which depends 
only on j3,w,9, and O. Consequently, 



(51) 



P sup sup |X„.^ - Xnx I > e; G 
V ^ vXeSt I 



< P sup sup ^JC{p,, w, 9, e)C"(/3, w, 9, O) \\v - Cll > e 



Now take a partition 0i, . . . , 0™ of Q such that for all £ = 1, . . . , m 

e 



< diam Qi < 



^C{l3,w,9,e)C'{(3,w,9,e) 

holds, where diamB^ denotes the diameter of the set Qi. Observe that since 8 C K^ 
is compact, there indeed exists a finite m for which this is satisfied. The righthand 
side of ([5T|| for such a partition is zero and consequently the conditions (i) and (ii) 
of Theorem 18.14 of Ivan der VaartI ( 19981 ) hold. This completes the proof of the 
lemma. D 



Appendix B 

Here we state and prove a modification of Proposition [T] for the case when the 
Ei's are bounded. 

Proposition 5. In the regression model ([7]) replace the assumption of Gaussianity 
of the ei 's by |ei| < C* for some constant C > and suppose Condition\^ holds. 
(i) If ^ is a > I times continuously differentiable and 6 — >■ as n ^>- cx), then 



(52) 



sup \Mt) - Kt)\ - Op U" + 4i + \/ ^ 



(ii) If n is a > 2 times continuously differentiable and 6 — >■ as n ^ 00, then 
(53) sup |/t;(t)-^'(t)|=Op(6"-i + 4^- 



te[<5,i-<5] 



nb'-^ 



logn 



ib^ 



is valid. Moreover, fin and fi'^ are consistent on [(5, 1 — (5], if nb^ / log n — > 00 holds 
additionally. 

Proof. The proof of (|5^ follows the same steps as the proof of ^. The only 
difference is that we need to show that 



(54) 



E 



max I Zj 
i<j<N ■' 



= 



log n 
nb 



holds also for bounded e^'s and not only for the Gaussian e^'s. To this end we will 
use some results from Chapter 2.2 of IWellner and van der Vaart I ( 19961) . Let ry be a 
nondecreasing and convex function on [0, 00), such that 7y(0) = 0. The Orlicz norm 



\X\\ri of a random variable X is defined as 



IXIL, = inf <^C>0 



m 

c 



< 1 
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A particular r] that we will use is 'r]{x) — exp(a;^) — 1. Since the e^'s have mean zero 
and a re bounded, for any a; > Hoeffding's inequality, see Theorem 2 in iHoeffdina 



(|1963[ ). implies 

P(|Zj| >x)< 2exp j -2x7 (5ZC'75',(sj))2 
By Condition [T] 



i=l 1=1 

1 „o„..„n /_ 1 \ 1 



< ^-C ||i^||i,cimax 2, max 



nb °° \ ^ n nbj C^nh 

holds. Thus the inequality 

P{\Zj\ >x)< 2exp(-2Con6a;^) 
is valid. By Lemma 2.2.1 of IWellner and van der Vaart I ( 19961) it then follows that 

(55) u\&y.\\Zj\\^<—=, 

3 Vnb 



where Ci depends on Cq only. Let ||A"||2 denote the L2 norm of a random variable 
X, i.e. IIXII2 = ^E[X2]. Notice that the inequality 

(56) 11^112 <||^IU, 

holds, because of ri(x) > x^. The inequalit ies ([55)1 and ([55)1 combined with Lemma 



2.2.2 of IWellner and van der Vaart I (|l996[) yield that 



IE 



max I Z,- p 



\/nb 



where the constant C3 is independent of N. Now notice that for A^ > 4 



Hence (f54| holds and this completes the proof of (|52|) . Formula (|53p can be proved 
in a similar fashion. D 
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